124 research outputs found

    Designing Virtualization-Based Infrastructure for Providing On-demand HPC Software-based Service

    Get PDF
    Mise Ă  jour de la liste des auteursThe emerging of Internet-based computing, namely cloud computing, has increased the possibility of sharing remote resource. On the other hand, while the usefulness and how to use grid computing for sharing hardware resource have been well studied, such studies do not seem to be available concerning software. We propose in this paper an on-demand service model to share software within grids. Targeting performance, flexibility and resource use efficiency, we discuss virtualization-enabled features to suggest an infrastructure model, and an implementation approach based on service execution within virtual machines (VM). The study show that, in spite of the potential performance overhead, on which conventional wisdom has based its unsuitability for HPC, virtualization can enable relevant features concerning on-demand cloud-like service. The possibility to build flexible and highly-reconfigurable infrastructure, and the possibility to easily achieve efficiency-driven resource scheduling are examples. Based on an OpenNebula infrastructure along with Xen as VM Monitor backend, we implement a prototype, and carry out experiments concerning performance using the parsec benchmark. Our results show that, with suitable tuning, VM can achieve near native performance, even when applications run onto single node-hosted concurrent VMs

    Vérification interactive de propriétés à l'exécution d'un programme avec un débogueur

    Get PDF
    National audienceLe monitoring est l'étude d'un système pendant son exécution, en surveillant les évènements qui y entrent et qui en sortent, afin de découvrir, vérifier ou pour faire respecter des propriétés à l'exécution. Le débogage est l'étude d'un système pendant son exécution afin de trouver et comprendre ses dysfonctionnements dans le but de les corriger, en inspectant son état interne, de manière interactive. Dans ce papier, nous combinons le monitoring et le débogage en définissant un moyen efficace et pratique de vérifier automatiquement des propriétés à l'exécution d'un programme à l'aide d'un débogueur afin d'aider à détecter des anomalies dans son code, en conservant le caractère interactif du débogage classique

    Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures

    Get PDF
    The implementation of a full electronic structure calculation code on a hybrid parallel architecture with Graphic Processing Units (GPU) is presented. The code which is on the basis of our implementation is a GNU-GPL code based on Daubechies wavelets. It shows very good performances, systematic convergence properties and an excellent efficiency on parallel computers. Our GPU-based acceleration fully preserves all these properties. In particular, the code is able to run on many cores which may or may not have a GPU associated. It is thus able to run on parallel and massive parallel hybrid environment, also with a non-homogeneous ratio CPU/GPU. With double precision calculations, we may achieve considerable speedup, between a factor of 20 for some operations and a factor of 6 for the whole DFT code.Comment: 14 pages, 8 figure

    MAi: Memory Affinity Interface

    Get PDF
    In this document, we describe an interface called MAI. This interface allows developers to manage memory affinity in NUMA architectures. The affinity unit in MAI is an array of the parallel application. A set of memory policies implemented in MAI can be applied to these arrays in a simple way. High-level functions implemented in MAI minimize developers work when managing memory affinity in NUMA machines. MAI's performance has been evaluating on two different NUMA machines using some parallel applications. Results obtained with MAI present important gains when compared with the standard memory affinity solutions

    Design methodology for workload-aware loop scheduling strategies based on genetic algorithm and simulation

    Get PDF
    International audienceIn high-performance computing, the application's workload must be evenly balanced among threads to deliver cutting-edge performance and scalability. In OpenMP, the load balancing problem arises when scheduling loop iterations to threads. In this context, several scheduling strategies have been proposed, but they do not take into account the input workload of the application and thus turn out to be suboptimal. In this work, we introduce a design methodology to propose, study, and assess the performance of workload-aware loop scheduling strategies. In this methodology, a genetic algorithm is employed to explore the state space solution of the problem itself and to guide the design of new loop scheduling strategies, and a simulator is used to evaluate their performance. As a proof of concept, we show how the proposed methodology was used to propose and study a new workload-aware loop scheduling strategy named smart round-robin (SRR). We implemented this strategy into GNU Compiler Collection's OpenMP runtime. We carry out several experiments to validate the simulator and to evaluate the performance of SRR. Our experimental results show that SRR may deliver up to 37.89% and 14.10% better performance than OpenMP's dynamic loop scheduling strategy in the simulated environment and in a real-world application kernel, respectively

    Minas: Memory Affinity Management Framework

    Get PDF
    In this document, we introduce Minas, a memory affinity management framework for cache-coherent NUMA Non-Uniform Memory Access) platforms, which provides either explicit memory affinity management or automatic one with efficiency and architecture abstraction for numerical scientic applications. The explicit tuning is based on an API called MAi (Memory Affinity interface) which provides simple functions to manage allocation and data placement using an extensive set of memory policies. An automatic tuning mechanism is provided by the preprocessor named MApp (Memory Anity preprocessor). MApp analyses both the application source code and the target cache-coherent NUMA platform characteristics in order to automatically apply MAi functions at compile time. Minas efficiency and architecture abstraction have been evaluated on two cache-coherent NUMA platforms using three numerical scientic HPC applications. The results have shown signicant gains when compared to other solutions available on Linux (First-touch, libnuma and numactl)

    Autonomic Parallelism and Thread Mapping Control on Software Transactional Memory

    Get PDF
    International audienceParallel programs need to manage the trade-offbetween the time spent in synchronization and computation.The time trade-off is affected by the number of active threadssignificantly. High parallelism may decrease computing time whileincrease synchronization cost. Furthermore thread locality ondifferent cores may impact on program performance too, asthe memory access time can vary from one core to anotherdue to the complexity of the underlying memory architecture.Therefore the performance of a program can be improved byadjusting the number of active threads as well as the mapping ofits threads to physical cores. However, there is no universal rule todecide the parallelism and the thread locality for a program froman offline view. Furthermore, an offline tuning is error-prone.In this paper, we dynamically manage parallelism and threadlocalities. We address multiple threads problems via SoftwareTransactional Memory (STM). STM has emerged as a promisingtechnique, which bypasses locks, to address synchronization issuesthrough transactions. Autonomic computing offers designers aframework of methods and techniques to build autonomic systemswith well-mastered behaviours. Its key idea is to implementfeedback control loops to design safe, efficient and predictablecontrollers, which enable monitoring and adjusting controlledsystems dynamically while keeping overhead low. We propose todesign a feedback control loop to automate thread managementat runtime and diminish program execution time

    Faithful Performance Prediction of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures

    Get PDF
    International audienceSUMMARY Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far build on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator of distributed systems. This approach allows to obtain performance predictions of classical dense linear algebra kernels accurate within a few percents and in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not. Additionally, it allows to conduct robust and extensive scheduling studies in a controlled environment whose characteristics are very close to real platforms while having reproducible behavior

    Control of Autonomic Parallelism Adaptation on Software Transactional Memory

    Get PDF
    International audienceParallel programs need to manage the trade-off between the time spent in synchronization and computation. A high parallelism may decrease computing time while increase synchronization cost among threads. A way to improve program performance is to adjust parallelism to balance conflicts among threads. However, there is no universal rule to decide the best parallelism for a program from an offline view. Furthermore, an offline tuning is error-prone. Hence, it becomes necessary to adopt a dynamic tuning-configuration strategy to better manage a STM system. Software Transactional Memory (STM) has emerged as a promising technique, which bypasses locks, to address synchronization issues through transactions. Autonomic computing offers designers a framework of methods and techniques to build automated systems with well-mastered behaviours. Its key idea is to implement feedback control loops to design safe, efficient and predictable controllers, which enable monitoring and adjusting controlled systems dynamically while keeping overhead low. We propose to design feedback control loops to automate the choice of parallelism level at runtime to diminish program execution time
    • …
    corecore